Red Wine Quality Analysis

library(ggplot2)
library(knitr)
library(tidyr)
library(psych)
library(dplyr)
theme_set(theme_classic())
wine <- read.csv('wineQualityReds.csv')

Summary

In this brief, we explore the Red Wine Quality dataset from P. Cortez et al., available through UCI’s ML repository [1]. The dataset collects observations on a variety of chemical features of Vinho Verde red wines, along with the median rating of those wines by at least three experts.

Our aims are to analyze what chemical properties contribute to wine quality and to develop a preliminary model for predicting the quality of a particular wine, given its properties. The analysis might be of interest to Vinho Verde vintners, who could use it to refine their techniques and processes; the model might be of interest to importers or merchants, who could use it to tailor their purchases. In order to make the results interpretable, we use a regression model, introducing non-linear terms sparingly.

Our preliminary model takes the form:

\[y \approx 0.05 + (0.35 \times alcohol) + (0.30 \times sulphates) -\\ (0.05 \times sulphates^2) - (0.23 \times volatile.acidity) - \\ (0.09 \times total.sulfur.dioxide) - (0.07 \times chlorides) + \epsilon,\]

Note that the response and all of the predictors are standardized. We see that two chemical properties, alcohol level and volatile acidity, exhibit a strong linear relationship with wine quality; the former has a positively association with quality, while the latter has a negative association. Two other features, total sulfur dioxide and chlorides, exhibit a weak negative relationship with quality. A fifth feature, sulphates, exhibits a strong but non-linear relationship with quality. The headline therefore reads: Higher-quality wines are associated with higher alcohol content, lower volatile acidity, and high (but not extreme) levels of sulphates.

One shouldn’t attempt to generalize these results too far. Notably, there are no wines in this dataset with a quality score greater than 8 (or less than 3), so it’s not possible to say what makes for a truly Vinho Verde red wine (or a truly terrible one). There is also no reason to expect that these results will hold for other varieties of wine. Nevertheless, the results may offer some guidance to vintners and merchants of Vinho Verde red wine.

Overview of the dataset

Let’s verify that there are no null values in the dataset. Then we can take a look at its shape.

sum(is.na(wine))
## [1] 0
str(wine[2:13])
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We have nearly 1,600 observations of 11 independent variables and 1 response variable, quality. Quality is integer-valued – a rating from 1 to 10 – but all the other variables are continuous-valued. Some of the variables, such as pH, are fairly familiar; others, such as volatile acidity, are not.

We’ll take a look at summary statistics next:

summary(wine[2:13])
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The response variable, quality, has a minimum of 3 and a maximum of 8. In other words, the dataset doesn’t contain any extremely low-quality wines, nor does it contain any extremely low-quality wines. Most wines received either a 5 or a 6 rating. The narrow range suggests that it may prove difficult to tease out what makes for a good red wine.

With respect to the chemical properties, we observe some variables with a similary narrow range (e.g. citric acid, density), yet there are others that show a broader range. Total sulfur dioxide, for example, ranges across several orders of magnitude.

Univariate EDA

As a first pass, we’ll focus on the following variables:

  • quality
  • alcohol
  • pH
  • residual.sugar (which relates to sweetness)
  • volatile.acidity (which relates to bitterness)
  • chlorides (which relates to saltiness)
  • total.sulfur.dioxide (a preservative related to freshness)

To be sure, there are other variables of interest. In a later section, we’ll consider through more comprehensive and efficient means whether to pay attention to these.

Quality

ggplot(aes(x = quality), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Quality ratings of wines", x = "Quality rating", y = "Count")) +
  scale_x_continuous(limits = c(0,10), breaks = seq(0,10,1))

summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

As we saw above, quality is integer-valued. The distribution is fairly normal, with a mean of 5.64, a slightly negative skew, and short tails. Some wines are quite good, but not many; even fewer are quite bad.

Alcohol

ggplot(aes(x = alcohol), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Alcohol content of wines", x = "Alcohol content (% abv)", y = "Count")) +
  scale_x_continuous(limits = c(8,15), breaks = seq(8,15,1))

summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Most wines are in the 9%-12% alcohol by volume (abv) range, with a peak around 9% abv and a mean of 10.42% abv. The range is 8.4%-14.90%. The distribution is positively skewed.

In the greater context of wine, this histogram is surprising. Most red wines fall in the 12%-16% abv range, whereas the wines in this dataset are mostly in the 9%-12% abv range. It turns out, however, that such a range is common for Vinho Verde [2], so our data seem trustworthy. And since 14.9% is still a reasonable abv for red wine in general, we won’t exclude any of the high-abv wines here. That said, given the lower alcohol content of Vinho Verde, we should not presume to generalize any findings here to red wine in general.

pH

ggplot(aes(x = pH), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "pH of wines", x = "pH", y = "Count")) +
  scale_x_continuous(limits = c(2,5), breaks = seq(2,5,0.5))

summary(wine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

This is a pretty solid normal distribution, with a median nearly equal to its mean.

Residual sugar

ggplot(aes(x = residual.sugar), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Residual sugar of wines", 
            x = "Residual sugar (g/dm^3)", y = "Count")) +
  scale_x_continuous(limits = c(0,16), breaks = seq(0,16,4))

summary(wine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Here, we see a more irregular distribution, with a long tail. We can see whether a log transformation of the residual.sugar variable yields a distribution that’s more normal in appearance:

ggplot(aes(x = log(residual.sugar)), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Residual sugar of wines", 
            x = "log(Residual sugar [g/dm^3])", y = "Count"))

Indeed it does. We should consider such a transformation when building our model.

Volatile acidity

ggplot(aes(x = volatile.acidity), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Volatile acidity of wines", 
            x = "Volatile acidity (g/dm^3)", y = "Count")) +
  scale_x_continuous(limits = c(0,1.8), breaks = seq(0,1.8,0.4))

summary(wine$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

This distribution is fairly normal, with some variability near the mean of the distribution. There is, again, a tail on the positive end of the distribution. Some of these values fall outside the U.S. legal limit for volatile acidity, which is 1.2 g/dm^3, but not so far beyond that limit as to deserve exclusion from the dataset.

Chlorides

ggplot(aes(x = chlorides), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Chlorides of wines", 
            x = "Chlorides (g/dm^3)", y = "Count")) +
  scale_x_continuous(limits = c(0,0.7), breaks = seq(0,0.7,0.1))

ggplot(aes(x = log(chlorides)), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Chlorides of wines", 
            x = "log(Chlorides [g/dm^3])", y = "Count")) 

summary(wine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Here, the tail is very long, and it doesn’t disappear upon a log transformation.

For variables whose distributions have long tails, are these “extremes” intentional? Are some wines especially salty or sweet on purpose? Will the judges appreciate these choices, or are these extremes a sign of defects in the wine? Or do they not matter?

Total sulfur dioxide

ggplot(aes(x = total.sulfur.dioxide), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Total sulfur dioxide of wines", 
            x = "Total sulfur dioxide (mg/dm^3)", y = "Count")) +
  scale_x_continuous(limits = c(0,300), breaks = seq(0,300,50))

ggplot(aes(x = log10(total.sulfur.dioxide)), data = wine) +
  geom_histogram(color = I('Black'), fill = I('Blue')) +
  labs(list(title = "Total sulfur dioxide of wines", 
            x = "log10(Total sulfur dioxide [mg/dm^3])", y = "Count")) 

summary(wine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Here, we have a distribution that appears log-normal. There are a couple outliers on the positive tail, but they are within the U.S. legal limit of 350 mg/dm^3 [3].

Bivariate EDA

We now turn our attention to how chemical features relate to quality. We should also investigate whether any of the variables we haven’t explored merit a further look. One tacit assumption thus far has been that different “acid”-like variables were similar enough that choosing one representative variable was sufficient, and similary for “sulfur”-like variables. Let’s check those assumptions.

Scatterplot matrix

A more efficient way to determine which variables are correlated with each other and, more importantly, with our response variable, is to look at a scatterplot matrix. We’ll use the psych package to do so. We’ll also take this as a jumping off point into other bivariate analyses.

set.seed(1)
pairs.panels(sample_n(wine[2:13],1599),pch=".")

We learn many interesting things from this matrix. First, we get Pearson correlations amongst the “acid”-like variables and the “sulfur”-like variables.

  • Fixed.acidity and citric.acid are fairly well-correlated (r = 0.67), suggesting that we would be wise not to include both fixed.acidity and citric.acid in a linear regression model. But perhaps we should include neither, given that fixed.acidity is hardly correlated with quality and citric.acid has an unruly distribution. Meanwhile, volatile.acidity is moderately correlated with the unruly citric.acid (r = -0.55), but only weakly correlated with fixed.acidity.

  • With respect to “sulfur”-like variables, total.sulfur.dioxide and free.sulfur.dioxide are strongly correlated (r = 0.67), but neither is correlated with sulphates. Of these three, sulphates has the strongest correlation with quality. Yet it’s not a purely linear relationship, as we can see in the line of best fit.

More interesting, of course, is how all the variables relate to quality. We can classify variables as “moderate” or “weak” according to their corrrelation score.

Moderate correlations (0.3 < r < 0.5)

  • alcohol
  • volatile.acidity

Weak correlations (0.2 < r < 0.3)

  • sulphates (caveat: non-linear behavior)
  • citric.acid (caveat: non-normal distribution; correlation with volatile.acidity)

If we take log transformations of residual.sugar and total.sulfur.dioxide, these variables’ correlations with quality don’t move much.

cor(wine$quality, log(wine$total.sulfur.dioxide))
## [1] -0.1701427
cor(wine$quality, log(wine$residual.sugar))
## [1] 0.02353331

So, unless we find a reason to consider other variables later, we’ll stick with these four as our variables of interest – dropping pH, residual sugar, and chlorides – as we continue our exploratory analysis, taking note of the caveats above.

Next, let’s look at scatterplots of quality and each of our leading predictors in turn.

Quality vs. alcohol

ggplot(aes(x = alcohol, y = quality), data = wine) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. alcohol", 
            x = "Alcohol (% abv)", 
            y = "Quality"))

The correlation noted in the matrix is evident in the scatterplot too. An interesting structure to observe is that up to about alcohol = 10% abv, a significant majority of the wines have a quality score of less than 6, while beyond alcohol = 12% abv, a significant majority have a quality score of greater than 6, but that between these two values of alcohol, there is greater variability.

We also see the lone high-alcohol wine, with abv = 14.9%, dragging down the trend line. It would be interesting to collect more data on wines at this abv level.

Quality vs. volatile acidity

ggplot(aes(x = volatile.acidity, y = quality), data = wine) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. volatile acidity", 
            x = "Volatile acidity (g/dm^3)", 
            y = "Quality"))

We see evidence here of the negative trend already described. As volatile acidity increases, quality tends to decrease. For particularly low values of volatile acidity, the trend is reversed, and this variable becomes positively correlated with quality, suggesting that a little bit of acidity (or bitterness) is a good thing. Of course, this may just be an edge effect.

Quality vs. sulphates

ggplot(aes(x = sulphates, y = quality), data = wine) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. sulphates", 
            x = "Sulphates (mg/dm^3)", 
            y = "Quality"))

The weak linear trend indicated in the scatterplot matrix can now be recognized as a non-linear relationship. As the sulphates level increases from its minimum, the oxidation of the wine decreases, and quality improves. At a certain sulphate level, though, the trend reverses, and quality starts going down.

The non-linearity induced by high-sulphate values, which starts around sulphates = 0.9 g/dm^3, doesn’t appear to be of a few high-leverage observations. There are 59 wines with a sulphate values greater than or equal to 0.9, or roughly 7% of the dataset.

count(subset(wine, sulphates >= 0.9))
## Source: local data frame [1 x 1]
## 
##     n
## 1 110

This might mean the vintners intended to produce wines at this sulphate level but were unaware of the imapct on quality.

Quality vs. citric acid

ggplot(aes(x = citric.acid, y = quality), data = wine)  + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. citric acid", 
            x = "Citric acid (g/dm^3)", 
            y = "Quality"))

Overall, we see the basic weak but positive trend that we expected. There are some unexpected “flatlines” in the curve here. Between citric acid values of 0.0 and 0.2, and again between 0.4 and 0.5, quality doesn’t seem to change much.

There are also some similarities to the quality vs. alcohol plot. There is a value of citric acid (~0.25 g/dm^3) below which high-quality wine are relatively uncommon. On the other end of the spectrum, there is a value of citric acid (~0.5 g/dm^3) above which low-quality wines are relatively uncommon.

Taken together, these scatterplots suggest that a single chemical factor is capable of causing a wine’s lack of extreme success or failure but that no factor on its own can guarantee a high-quality or low-quality wine.

In the next section, we’ll explore what balance of factors is needed to produce good Vinho Verde.

Multivariate EDA

We now bring a third dimension into our exploratory analysis. The continuous-valued nature of our independent variables will make a color-based encoding difficult to interpret, so let’s bucket each of them by quartile.

# Cut predictors into quartile-buckets
wine$alcohol.bucket <- cut(wine$alcohol, 
                           c(quantile(wine$alcohol)))

wine$volatile.acidity.bucket <- cut(wine$volatile.acidity,
                           c(quantile(wine$volatile.acidity)))

wine$sulphates.bucket <- cut(wine$sulphates,
                           c(quantile(wine$sulphates)))

wine$citric.acid.bucket <- cut(wine$citric.acid,
                           c(quantile(wine$citric.acid)))

wine$sulphates.bucket <- cut(wine$sulphates,
                           c(quantile(wine$sulphates)[1],
                             quantile(wine$sulphates)[2],
                             quantile(wine$sulphates)[3],
                             quantile(wine$sulphates)[4],
                             0.9, 2))

Now let’s see whether these multvariate plots reveal any additional structure.

Quality vs. alcohol and other variables

ggplot(aes(x = alcohol, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = volatile.acidity.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. alcohol & volatile acidity", 
            x = "Alcohol (%abv)", 
            y = "Quality",
            color = "Volatile acidity (g/dm^3)"))

ggplot(aes(x = alcohol, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. alcohol & sulphates", 
            x = "Alcohol (%abv)", 
            y = "Quality",
            color = "Sulphates (g/dm^3)"))

ggplot(aes(x = alcohol, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = citric.acid.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. alcohol & citric acid", 
            x = "Alcohol (%abv)", 
            y = "Quality",
            color = "Citric acid (g/dm^3)"))

As we might expect, the bucketed variables add extra explanatory power. Holding alcohol level fixed, we see the variability in quality explained by the respective buckets in the directions we found earlier. For example, in the first plot, holding alcohol level fixed, we see a lower volatile acidity associated, on average, with a higher quality score.

Quality vs. volatile and other variables

ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = alcohol.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. volatile acidity & alcohol", 
            x = "Volatile acidity (g/dm^3)", 
            y = "Quality",
            color = "Alcohol (%abv)"))

ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. volatile acidity & sulphates", 
            x = "Volatile acidity (g/dm^3)", 
            y = "Quality",
            color = "Sulphates (g/dm^3)"))

ggplot(aes(x = volatile.acidity, y = quality), data = wine) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = citric.acid.bucket)) +
  geom_smooth() +
  labs(list(title = "Quality vs. volatile acidity & citric acid", 
            x = "Volatile acidity (g/dm^3)", 
            y = "Quality",
            color = "Citric acid (g/dm^3)"))

Again, the weakly correlated varibles add extra explanatory power to the two-dimensional plots of quality vs. volatile acidity.

Averaging by quartile

At this point, it’s natural to simplify our plots even more through aggregation. Taking the “quality vs. alcohol & sulphates” plot, let’s see what happens to the quality-alcohol correlation when we average within sulphate quartiles.

wine.group_by_sulphates.bucket <- wine %>%
  filter(!is.na(sulphates.bucket)) %>%
  group_by(alcohol, sulphates.bucket) %>%
  summarise(mean_quality = mean(quality),
            n = n())

ggplot(aes(x = alcohol, y = mean_quality), data = wine.group_by_sulphates.bucket) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = sulphates.bucket)) +
  geom_smooth() +
  labs(list(title = "Mean quality vs. alcohol & sulphates", 
            x = "Alcohol (% abv)", 
            y = "Mean quality",
            color = "Sulphates (g/dm^3)"))

Sulphates do clearly explain some of that variance. Most of the points above the smoothing curve are high-sulphate (but not too high!), and most of the points below are low-sulphate.

Let’s examine a similar plot, this time averaging within volatile acidity quartiles.

wine.group_by_volatile.acidity.bucket <- wine %>%
  filter(!is.na(volatile.acidity.bucket)) %>%
  group_by(alcohol, volatile.acidity.bucket) %>%
  summarise(mean_quality = mean(quality),
            n = n())

ggplot(aes(x = alcohol, y = mean_quality), 
       data = wine.group_by_volatile.acidity.bucket) +
  geom_jitter(alpha = 0.5, size = 3, aes(color = volatile.acidity.bucket)) +
  geom_smooth() +
  labs(list(title = "Mean quality vs. alcohol & volatile acidity", 
            x = "Alcohol (% abv)", 
            y = "Mean quality",
            color = "Volatile acidity (g/dm^3)"))

Again, a pattern emerges amidst the dots. Most of the points above the smoothing curve have a low volatile acidity, confirming the trend we saw in our bivariate analysis.

A line graph, though noisy, makes the trend stand out better:

ggplot(aes(x = alcohol, y = mean_quality), 
       data = wine.group_by_volatile.acidity.bucket) +
  geom_line(aes(color = volatile.acidity.bucket), size=2)  +
  labs(list(title = "Mean quality rating vs. alcohol content", 
            x = "Alcohol content (%abv)", y = "Quality rating",
            color = "Volatile acidity (g/dm^3)"))

If we smooth out noise by rounding alcohol values, then the pattern is instantly comprehensible.

ggplot(aes(x = round(alcohol), y = quality), data = wine) +
  geom_line(aes(color = volatile.acidity.bucket), 
            stat = "summary", fun.y = mean, size = 2) +
  labs(list(title = "Mean quality rating vs. alcohol content", 
            x = "Alcohol content (%abv)", y = "Quality rating",
            color = "Volatile acidity (g/dm^3)"))

On average, then, as alcohol goes up, and as volatile acidity goes down, quality increases (module edge effects).

Let’s see how sulphates behave in a plot of this type:

ggplot(aes(x = round(alcohol), y = quality), data = wine) +
  geom_line(aes(color = sulphates.bucket), 
            stat = "summary", fun.y = mean, size = 2) +
  labs(list(title = "Mean quality rating vs. alcohol content", 
            x = "Alcohol content (%abv)", y = "Quality rating",
            color = "Sulphates (g/dm^3)"))

Now we can see that a wine with a very high sulphate concentration (sulphates > 0.9 g/dm^3) is indeed of lower quality than moderately high-sulphate wine (0.73 < sulphates <= 0.9) across all alcohol levels, albeit not by much. In other words, too low a sulphate concentration may be a bad thing for freshness, but too high a concentration may have an adverse impact on quality too.

A basic model

We can now fit a basic model to the data. We’ll regress on all possible predictors, even ones we haven’t examined recently. First, we’ll standardize our feature set in order to make the regression coefficents more interpretable.

scalewine <- data.frame(scale(wine[, 2:13]))
lm.fit <- lm(quality ~ ., data=scalewine)
summary(lm.fit)
## 
## Call:
## lm(formula = quality ~ ., data = scalewine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3299 -0.4539 -0.0582  0.5597  2.5075 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.604e-16  2.007e-02   0.000   1.0000    
## fixed.acidity         5.388e-02  5.594e-02   0.963   0.3357    
## volatile.acidity     -2.403e-01  2.685e-02  -8.948  < 2e-16 ***
## citric.acid          -4.404e-02  3.550e-02  -1.240   0.2150    
## residual.sugar        2.851e-02  2.619e-02   1.089   0.2765    
## chlorides            -1.092e-01  2.444e-02  -4.470 8.37e-06 ***
## free.sulfur.dioxide   5.649e-02  2.812e-02   2.009   0.0447 *  
## total.sulfur.dioxide -1.330e-01  2.968e-02  -4.480 8.00e-06 ***
## density              -4.179e-02  5.056e-02  -0.827   0.4086    
## pH                   -7.908e-02  3.663e-02  -2.159   0.0310 *  
## sulphates             1.923e-01  2.400e-02   8.014 2.13e-15 ***
## alcohol               3.645e-01  3.495e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8024 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

The features that have both a (relatively) high coefficient and a low p-value are alcohol, sulphates, volatile.acidity, chlorides, and total.sulfur.dioxide. The first three we expect; the other two are new to the party. (Note that citric acid, which we had been exploring as a predictor, has a high p-value.)

Why are chlorides and total.sulfur.dioxide showing up here as good predictors? If we return to our scatterplot matrix, we see that their correlation values are r = -0.13 and r = -0.19, respectively. Let’s examine the relation of each to quality more closely.

Here’s a plot of quality vs. chlorides:

ggplot(aes(x = chlorides, y = quality), data = wine) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. chlorides", 
            x = "Chlorides (g/dm^3)", 
            y = "Quality"))

From this plot, we would think that the “correlation” is due in large part to the extremes. We see a few data points around x = 0 pulling the fit up and a few high-leverage data points around x = 0.6 dragging it down. What does the fit look like when we remove these points?

wine2 <- wine[(wine$chlorides > min(wine$chlorides)) & 
               (wine$chlorides < 0.5), ]
ggplot(aes(x = chlorides, y = quality), data = wine2) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. chlorides", 
            x = "Chlorides (g/dm^3)", 
            y = "Quality"))

This looks more reasonable. Is the r-value for chlorides markedly different?

cor(wine2$quality, wine2$chlorides)
## [1] -0.1158525

It drops only by 0.02. What about the regression coefficient?

scalewine2 <- data.frame(scale(wine2[, 2:13]))
lm.fit2 <- lm(quality ~ ., data=scalewine2)
coef(lm.fit2)
##          (Intercept)        fixed.acidity     volatile.acidity 
##         5.993701e-17         3.544456e-02        -2.359438e-01 
##          citric.acid       residual.sugar            chlorides 
##        -3.778583e-02         2.412959e-02        -8.702992e-02 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##         5.941702e-02        -1.393855e-01        -2.593329e-02 
##                   pH            sulphates              alcohol 
##        -8.631344e-02         1.953509e-01         3.720488e-01

The p-value is still low, but the coefficient has gone from -1.1e-01 to -8.7e-02, suggesting that the strong presence of chlorides in the first model was artificial. The RSE and R-squared values are just as good, though, meaning that the second model is as good as the first.

Next, we want to run total.sulfur.dioxide through the same sanity check. The first step is to examine a plot of quality vs. total.sulfur.dioxide:

ggplot(aes(x = total.sulfur.dioxide, y = quality), data = wine) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. total sulfur dioxide", 
            x = "Total sulfur dioxide (g/dm^3)", 
            y = "Quality"))

Again, we see extreme values of the predictor driving the fit. Let’s remove data points where x > 200.

wine3 <- wine2[wine2$total.sulfur.dioxide < 200, ]
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = wine3) + 
  geom_jitter(alpha=0.5, color=I('Blue')) +
  geom_smooth(color=I('red')) +
  labs(list(title = "Quality vs. total sulfur dioxide", 
            x = "Total sulfur dioxide (g/dm^3)", 
            y = "Quality"))

Much more reasonable. How does this affect r?

cor(wine3$quality, wine3$total.sulfur.dioxide)
## [1] -0.2113839

Its magnitude increases (gets more negative) by 0.02. How about the regression coefficient?

scalewine3 <- data.frame(scale(wine3[, 2:13]))
lm.fit3 <- lm(quality ~ ., data=scalewine3)
coef(lm.fit3)
##          (Intercept)        fixed.acidity     volatile.acidity 
##         1.172834e-15         2.420604e-02        -2.335403e-01 
##          citric.acid       residual.sugar            chlorides 
##        -3.814583e-02         1.266170e-02        -8.684811e-02 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##         7.189222e-02        -1.586297e-01        -8.278799e-03 
##                   pH            sulphates              alcohol 
##        -8.776426e-02         1.979463e-01         3.732353e-01

The regression coefficient drops a little but remains the same order of magnitude.

Taking a cue from backward selection, we’ll now remove all the other predictors – including chlorides – from the multiple regression model and see where we stand. Since none of our predictors are pairwise correlated, we won’t worry about interaction terms.

lm.fit2 <- lm(quality ~ alcohol + sulphates + volatile.acidity + 
                total.sulfur.dioxide, data=scalewine2)
summary(lm.fit2)
## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     total.sulfur.dioxide, data = scalewine2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4111 -0.4712 -0.0781  0.5570  2.6260 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.597e-16  2.025e-02   0.000        1    
## alcohol               3.843e-01  2.114e-02  18.176  < 2e-16 ***
## sulphates             1.681e-01  2.113e-02   7.952 3.46e-15 ***
## volatile.acidity     -2.597e-01  2.142e-02 -12.121  < 2e-16 ***
## total.sulfur.dioxide -9.531e-02  2.080e-02  -4.583 4.94e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8088 on 1590 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3459 
## F-statistic: 211.7 on 4 and 1590 DF,  p-value: < 2.2e-16

All four predictors have a very low p-value and regression coefficients that, while not enormous, could contribute collectively to a not insignificant improvement in quality. To put it in more intelligible terms, if a vintner tweaked each of these four chemical factors by 1 standard deviation (that is, 1 unit in this standardized scale), we could expect the quality rating to increase up `0.38 + 0.17 + |-0.26| + |-0.11| = 0.92 – almost a full quality point!

A couple troublesome points remain. First, the intercept for all of our model iterations thus far has not been significant. Really not significant: p = 1. So, although we are getting a good sense for how our predictors influence wine quality, if we wanted to use this model to actually predict how a judge would rate a new wine, we would be in trouble, because we don’t have a baseline for the intercept in the model. Another trouble is that we know that our sulphates feature has a non-linear relationship with quality. If we return to the quality vs. sulphates plot, we see that quality is positively correlated with sulphates, until about 0.8 mg/dm^3, at which point quality begins to dip. We have enough data points for sulphates > 0.8 to believe that this effect is real.

How would our model change if we allowed a second-order sulphates term? If we try to account for that non-linearity in one of our most important predictors, will we recover a reliable intercept?

lm.fit4 <- lm(quality ~ alcohol + sulphates + volatile.acidity +
               chlorides + total.sulfur.dioxide + 
               I(sulphates^2), data=scalewine2)
summary(lm.fit4)
## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + I(sulphates^2), data = scalewine2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4660 -0.4613 -0.0664  0.5522  2.4897 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.04541    0.02141   2.121  0.03410 *  
## alcohol               0.35081    0.02170  16.163  < 2e-16 ***
## sulphates             0.30464    0.02902  10.497  < 2e-16 ***
## volatile.acidity     -0.22912    0.02161 -10.601  < 2e-16 ***
## chlorides            -0.06813    0.02218  -3.072  0.00216 ** 
## total.sulfur.dioxide -0.08518    0.02063  -4.129 3.83e-05 ***
## I(sulphates^2)       -0.04544    0.00772  -5.886 4.82e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7978 on 1588 degrees of freedom
## Multiple R-squared:  0.366,  Adjusted R-squared:  0.3636 
## F-statistic: 152.8 on 6 and 1588 DF,  p-value: < 2.2e-16

Now w have an intercept that is statistically significant (with p < 0.05), a model of quality that’s linear in alcohol level, volatile acidity, and chlorides, and quadratic in sulphates. We could do k-fold cross-validation and compares the test error (RSS) of the quadratic model to our top-performing linear model, but a quicker way to compare the two is to run an ANOVA:

anova(lm.fit2, lm.fit4)
## Analysis of Variance Table
## 
## Model 1: quality ~ alcohol + sulphates + volatile.acidity + total.sulfur.dioxide
## Model 2: quality ~ alcohol + sulphates + volatile.acidity + chlorides + 
##     total.sulfur.dioxide + I(sulphates^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1590 1040.0                                  
## 2   1588 1010.6  2    29.359 23.066 1.335e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA tells us with great confidence that the polynomial model outperforms the original model. For a more accurate estimate of the standard error associated with each coefficient in the model, we can do some bootstrapping:

library(boot)
boot_coef <- function(data, index){
  mylm <- lm(quality ~ alcohol + sulphates + volatile.acidity + 
               chlorides + total.sulfur.dioxide +
               I(sulphates^2), data=data, subset=index)
  return(coef(mylm))
}
boot(scalewine2, boot_coef, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = scalewine2, statistic = boot_coef, R = 1000)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1*  0.04541185  3.025494e-03  0.02260384
## t2*  0.35081308 -9.078001e-04  0.02631022
## t3*  0.30464538  1.980977e-03  0.02869558
## t4* -0.22911547 -7.840295e-05  0.02487511
## t5* -0.06812819  7.593913e-04  0.02217671
## t6* -0.08517552  7.897571e-05  0.02241710
## t7* -0.04544033 -2.866906e-03  0.01094401

These standard errors are higher than the errors in the original model fit, but this is to be expected, given the assumption-breaking nonlinearity in the model and the fact that we fit the model on the entirety of the training data.

Conclusion

One of the main challenges with this data set is the narrow range of quality ratings available and the discrete values that these quality scores take on. It would be instructive to analyze more extreme ratings (i.e. wines with ratings of 1, 2, 9, or 10) as well as more nuanced ratings (perhaps by taking the mean of the experts’ ratings rather than the median).

Nevertheless, after exploring the wines in this dataset and fitting a regression model, we found two chemical features, alcohol content and volatile acidity, that seem to be strong predictors for wine quality. Two other features, total sulfur dioxide and chlorides, are weaker predictors. A fifth feature, sulphates, is a strong predictor, although its effects are non-linear. Sulphates, which are used to preserve a wine’s freshness, may have a “sweet spot”; too low a level of sulphates, and the wine doesn’t taste fresh, but too high a level of sulphates, and it tastes overly preserved. Given more data and/or more varied data, we might discover similar sweet spots in the other features.

Finally, we note that Cortez et al. developed their own predictive model of wine quality. Our model and theirs share four of five top predictors, though their model assign different weights to each feature. In addition, their model excludes chlorides as a feature and includes pH. Investigating the precise nature of these discrepancies is a subject for future exploration.

References

  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
  2. http://en.wikipedia.org/wiki/Vinho_Verde
  3. http://waterhouse.ucdavis.edu/whats-in-wine/